This assignment is for ETC5521 Assignment 1 by Team wattle comprising of Ketan Kabu 31554679, Yiwen Zhang 31203019, Hanh Ngo 31196101, and Xue Wang 30032245.
Women’s soccer has gained a lot of media attention in the last few years, and deserving so. Countries like the USA, a dominant force in women’s soccer, have produced some top quality players like Megan Rapinoe, Carli Lloyd, Alex Morgan and Julie Ertz. Just north of the USA, Christine Sinclair from Canada became the first player to win the Lou Marsh Award as Canadian Athlete of the Year, which is unsurprising considering her outstanding haul of 182 career international goals. Ada Hegerberg from Norway became the first woman to win the prestigious Ballon d’Or award, an yearly honor awarded to best player in the world, which was first introduced for the women’s game in 2018.(SportMob 2020)
The advancement of women’s soccer has been parallel with the understanding and acceptance of data and analytic in the game. The technological advancement in capturing and analysing data has exploded since the late 1990s, when it began. Clubs have used this technology to identify and scout talent from all over the world.
Every country has a different approach towards coaching and developing talent. Success at an international level could be down to the ability of world class players playing with a certain level of chemistry, which could be challenging as it is common for them to be playing for different rival clubs at the domestic level. In the men’s game, we saw Italy, Spain and Germany winning world cups in 2006, 2010 and 2014 respectively. Many partly attributed the success of these teams towards the fact that majority of these players playing in the same domestic league (or country), or even in the same club, especially those clubs that produce the most successful international players.
In this paper, we will approach the women’s game with the same lens. We believe that these findings could prove meaningful to coaches, scouts and even young players in the game looking to choose clubs to play for with future international success in mind.
The squads data set also provided the information regarding a player’s position on the field, be it the Midfielders, Defenders and Goalkeepers. Each have various other parameters they can be judged on - like assists, distance covered in a match, tackles, clearances, saves(goalkeepers) and various others. If this information is available, better player performance insights can be achieved. However, we only have the goals information which may not did a justice for any positions rather than the Forwards. For the sake of this report, we will forgo the analysis of the player’s position, as analysis may not make a lot of sense with the current level of information provided.
We could also do match by match analysis of the world cup if we can obtain information about how each player was rated, or scored certain number of goals, in a world cup match. In our clubs data, we have a reference named as “Unattached”, which will appear sometimes in our analysis. This signifies that the player(s) is not contracted with any domestic club when the data was collected.
This dataset is about Women’s World Cup, which originally comes from website data.world https://data.world/sportsvizsunday/womens-world-cup-data.It consists of final score and win/loss status data from 1991-2019. Additionally, the 2019 world cup rosters for each team are included. After being cleaned by Thomas Mock at tidy tuesday challenge, we downloaded the two dataset from his github at https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-07-09. Figure 2.1 shows the missing data for squads.
| Variable | Class | Description |
|---|---|---|
| year | double | Year of tournamente |
| team | character | Abbreviated team |
| score | double | Score by team |
| round | character | Round of the tournament |
| yearly_game_id | integer | Grouping variable - pairs team 1 and team 2 by round/year |
| team_num | double | team num (1 or 2) |
| win_status | character | Win Status = win/lose or tie (group only) |
| Variable | Class | Description |
|---|---|---|
| squad_no | double | Squad number (1 through 23) |
| country | character | Country |
| pos | character | position |
| player | character | Player name |
| dob | date | date of birth |
| age | double | age (years) |
| caps | double | caps - international games played |
| goals | double | Goals - international goals scored |
| club | character | Professional Club |
For the “squads” dataset, we made two improvements:
- The missing value for the caps and goals were mostly of the China PR, Nigeria and Cameroon teams. We had managed to find and added the missing caps and goals for the China PR team from Wikipedia which can be found here.
- Asides from the squads information of 2019 that was provided, we tried to expand the dataset to include the information of thep previous World Cups. Information was gathered manually by hands from Wikipedia and stored in a csv file before merged with the squads information of 2019. The new dataset now has the new dimension as below:
FALSE Rows: 2,870
FALSE Columns: 10
FALSE $ squad_no <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
FALSE $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", …
FALSE $ pos <chr> "GK", "FW", "MF", "DF", "DF", "MF", "DF", "MF", "MF", "FW", …
FALSE $ player <chr> "Alyssa Naeher", "Mallory Pugh", "Sam Mewis", "Becky Sauerbr…
FALSE $ dob <dttm> 1988-04-20, 1998-04-29, 1992-10-09, 1985-06-06, 1988-08-04,…
FALSE $ age <dbl> 31, 21, 26, 34, 30, 26, 26, 27, 25, 36, 34, 20, 29, 25, 33, …
FALSE $ caps <dbl> 43, 50, 47, 155, 115, 82, 37, 79, 66, 271, 99, 19, 160, 31, …
FALSE $ goals <dbl> 0, 15, 9, 0, 2, 6, 0, 18, 8, 107, 1, 1, 101, 0, 44, 6, 28, 0…
FALSE $ club <chr> "Chicago Red Stars", "Washington Spirit", "North Carolina Co…
FALSE $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, …
After we joined the two datasets, once again we checked the missing values for this newly created “squads_cleaned”
Figure 2.1: Missing data for squads
The information of caps, goals and clubs were mostly not available for the World Cup that took place too long ago. In each analysis, we will filter only the years that have suffice information to respond to the tasks on hand.
In the previous stage of this report, the wattle team had performed a range of analysis on the player contribution by club, goals score. player’s cap and player position. Even though the analysis was done thoroughly, the works lacked direction towards answering the primary question. In this improvised version, we tried to re- address the wattle team’s analysis by expanding the scope (by obtaining more data) to direct the results towards answering the primary question.
Details of change are as follow:
Before we start, there are a couple of questions we need to bear in mind to serve as the guidelines for our analysis, they are:
Primary question: What factors would describe a World Cup participant’s characteristics and if possible, the World Cup champion?
How does the a player’s domestic club influence her performance?
How does the age and the caps of a player influence her performance?
Is there any relation between the age of a player, the caps she has achieved and the goals she made?
Figure 4.1: Nation goals distribution
Looking into figure 4.1, the USA and Germany score significantly more goals each year than other countries, the kurtosis reflects the central tendency. The more right the kurtosis, the more goals are scored. The point shows the total number of goals scored by different countries each year. It can be seen that the most happened in the United State.
Remember to change the caption and titles, check the height of the graphics also.
Figure 4.2: Experimental layout
Figure 4.2 show the promotion situation for all the country attend the woman world cup between 1991 and 2019, the color represents the win, lose or tie, Each row represents a different stage of the game, such as group matches, semi-finals or final. It is clear to see that the United States has only lost four times in the past 20 years, The United States is a strong opponent in the Women’s World Cup, followed by Germany, Although the number of losses is greater than that of the United States, Germany has also won many times. Figure 4.2 clearly sees the situation of each country. It can be seen that in women’s football, the strength of different countries is very different. Why does this happen? What factors affect this phenomenon?
Figure 4.3: Total goals scored by different countries
Figure 4.3 is an interactive map. Click on the location on the map, and we can see the country’s name and the total number of goals scored in the World Cup. The color also indicates the number of goals. USA, Germany, and Norway those three countries have the most number of total goals, especially compared to other countries. Does the total number of goals have anything to do with the chance of winning the championship? Will the country with more goals win the game?
Figure 4.4: Correlation analysis
Figure 4.4 is a correlation analysis between total goals and change to win, The correlation coefficient is 0.768, which shows a positive correlation, that is, the more the total number of goals, the greater the chance of winning the championship.
Take note to come back to this part!
Draw a line chart/ any other approprate charts for this.
To find out what factors could determined a World Cup player, we tried to looked at two elements of her career: the player herself and the club that she played for in her normal time not attending the World Cup. In this section, we will have a look at the club first.
It’s no surprise when we would see some clubs or leagues that better represents the players than the others. Those clubs usually are big in size, providing more resources to produce a quality player. In the figure below, we will examine the biggest 12 clubs that are or were home to the highest numbers of World Cup players.
| club | Player |
|---|---|
| Lyon | 30 |
| Arsenal | 25 |
| Paris Saint-Germain | 25 |
|
22 |
| 25-Apr | 22 |
| Barcelona | 22 |
| Rivers Angels | 21 |
| Chelsea | 18 |
| Vancouver Whitecaps | 18 |
| VfL Wolfsburg | 17 |
| Linköping | 16 |
| LSK Kvinner | 16 |
We filtered for the 2007 World Cup up to 2019 World Cup (4 tournaments in total) as the club’s information was not fully provided for all of the prior tournaments. After that we counted the number of unique players that each club had brought to the World Cup.
Lyon contributed the highest number of players throughout the four World Cup tournaments. Asides from Lyon, the France Women’s First Division League had another national clubs got into the top 10, which was Paris Saint-Germain. Most of the other clubs in the ranking saw themselves belonged to the countries that are most well-known for their football history. They are Germany (1. FFC Frankfurt and VfL Wolfsburg), England (Arsenal and Chelsea), Norway (LSK Kvinner), Sweden (Linköping), Canada (Vancouver Whitecaps) and Spain (Barcelona).
We have two special cases that top the chart: 25-Apr of North Korea and Rivers Angels of Nigeria. These two clubs were special in their own way that their national World Cup teams mostly consisted of their players. We will revisit this point again when we get to the breakdown of Domestical and Foreign elements of these 12 clubs later in this section.
Figure 4.5: Clubs’ players by year from WC2007-WC2019
In terms of year, 5 out of the 12 clubs featured in the chart did not join the whole four World Cups. We can see clearly from the chart that 25-Apr had made an impressive contribution of players for the North Korea team in 2007 and 2011, which brought them to our ranking even though they dropped out in the recent two games.
Except for Chelsea and Barcelona, other clubs did not observe a significant change in the number of World Cup player in 2019. Rather, most of them went down in numbers. There is a couple of explanations for this observation:
- 2015 was the first year that allowed the number of participated teams to increased from 16 in 2011 to 24. As a result, we will see that most clubs saw their players surged in number in 2015.
- There is a fact in the football industry that big clubs usually hesitated to send their prominent players to the World Club. They want to reserve their stars for the national tournaments or other premier leagues. In the male football industry, this practice threatened the success of the World Cup so bad that FIFA even comes up with a commission scheme to pay the clubs for their “collaboration” (Homewood 2020). A drop in the female players’number could suggest two things: (1) the women’s football clubs also concerned for other games that they did not want to send away their star players, and (2) 2019 was the second World Cup that allowed 24 teams to participate. Other clubs might have used this chance to send their players off to the international field, making the share smaller for all.
For (1), we was confirmed that FIFA had announced the first beneficial package to promote the clubs for participating in the Women’s World Cup. This program was introduced ahead of the World Cup 2019. More details can be read here.
Finally in this club section, we will look at the composition of the players that each of these 12 clubs sent to World Cup. Would they be playing along or against their comrades?
Figure 4.6: Percentage of domestical player of top 12
We can see that Vancouver Whitecaps, 25-apr and Rivers Angels had most of their World Cup players coming from their own country. Other big names like Chelse, Arsena, Lyon or 1. FFC Frankfurt hold only 50% - 60% of their World Cup players domestic, other were exported. To understand better about how such composition would contribute to their performance in World Cup, we were tempted to see of all the international clubs featured in the World Cup, which one hold the highest of players and how would that related to their performance. However, the task of gathering the nation of each club took a lot of time and manual works, hence we referred to a study by FIFA for this question.
Figure 4.7: Players owner in World Cup 2019
In an article published on dw.com(Schacht 2019), FIFA has released their figures showing that US club held roughly 14% of total World Cup 2019 players - 73 players. That figure was very impressive if you considered there were hundreds of clubs from all over the world sending their players to the game. Followed behind were our familiar faces of Spain, France and England. The article also pointed out that Spain’s national women team qualified for the game first time in 2015. By that time, there were only 21 of the players belonged to the top Spanish teams. In 2019, the number increased to 51. The article also suggested that this change came from a significant improve in supports for women’s football in Spain in the recent years.
The domination of US and the rising of Spain suggested how a nation would support their players. It was of nature that players would come to the clubs that offered them the best of conditions. The better the condition, the more likely to attract top players and finally the better performance at any tournament. The United States had crowned the championship at the World Cup for 4 times out of 8 times the Women World Cup was held until now.
Next we will take a look deeper into the stars of the game - the players. In this section, we wanted to know if there are any factors that would determine a player’s chance to participate in the World Cup, or even further, how these factors can tell about the performance of such player in the game.
First of all, we will look at the overall age distribution of all the players through all eight World Cup tournaments.
Figure 4.8: Player’s age distribution
We chose the bindwith equals to 1, meaning each bindwith represented 1 year of age. The red line indicated the average age of the players by year. We could ignore the count for each year as the number of teams participated were not similar for the all the years. Instead we will concentrate on the distribution.
We can see that the age distribution by year varied quite significantly. From 1991 up to 2015, the age distribution was lightly right skewed, meaning players cluttered around the lower age - lower than 30 , around 20- 25 as we can refer from the charts. But coming to 2019, a clear shift was observed where the age cluttered around the higher age of 27. More observations were observed at points further beyond 30, indicating that players who qualified for World Cup are getting older than before. We can also see that the mean age of 2019 was higher than those of the other years.
Next up, we will comb through the age of all participants that took part in the World Cup from the very first tournament in 1991 to the most recent in 2019.
Figure 4.9: Summary of player age by country
On average, a World Cup player would not be younger than her 20s. The ideal age range for players would be around 25 to 27 years. The World Cup had witnessed a fair share of matured players who rocked the age of past 30s and even reached the 40s in history. Let’s locate the winners and the runners up for all of the past World Cup and see how the ages varied.
We have United States (4 wins, 1 runner up), Norway (1 win, 1 runner up), Germany (2 wins, 1 runner up), Japan (1 win, 1 runner up), China PR, Sweden, Brazil and Netherlands (each had 1 runner up). I have highlighted the countries mentioned in the chart above for easier reference. We can see that most of the countries resided on the right half of the chart, indicating a higher than average median age of players. The 4-time champion US even had a long right tail where the maximum age could reach 39. Other highlighted countries also possessed some players with the age felt into the outliers that passed 35. We have Japan and China on the other half of the chart, with China seems to have younger ladies than the others. Japan even though on the other half, had a long right tail of age distribution where players can get to the age of 36.
The number of caps of each player represents the number of times a player represents her team in an international play ground.
Table 4.2 below summarised top 10 players with the highest caps in the history of World Cup, along with the country they presented, their positions, the clubs they played for and their caps and goals at the most recent time they participated in the World Cup.
| country | pos | player | dob | caps | goals | club | year |
|---|---|---|---|---|---|---|---|
| US | MF | Kristine Lilly | 1971-07-22 | 338 | 128 | UNC | 2007 |
| US | DF | Christie Rampone | 1975-06-24 | 306 | 4 | Sky Blue | 2015 |
| Canada | FW | Christine Sinclair | 1983-06-12 | 282 | 181 | Portland Thorns | 2019 |
| US | FW | Carli Lloyd | 1982-07-16 | 271 | 107 | Sky Blue | 2019 |
| US | MF | Heather O’Reilly | 1985-01-02 | 219 | 41 | Kansas City | 2015 |
| Germany | FW | Birgit Prinz | 1977-10-25 | 211 | 128 |
|
2011 |
| Sweden | MF | Therese Sjögran | 1977-04-08 | 209 | 21 | Rosengård | 2015 |
| Japan | MF | Homare Sawa | 1978-09-06 | 197 | 82 | INAC Kobe Leonessa | 2015 |
| Sweden | MF | Caroline Seger | 1985-03-19 | 193 | 27 | Rosengård | 2019 |
| Scotland | MF | Joanne Love | 1985-12-06 | 191 | 13 | Glasgow City | 2019 |
Again, the US took up the majority of ranks in the ranking table, 4 out of 10. Among which, Kristine Lilly hold the highest records at the World Cup 2007 - her last World Cup before retirement. After that and up until now, no players have been able to surpass her. Christie Rampone got the closest at the moment with 32 caps in difference.
We will look further and compare the caps countries to countries. For this task, we would take into focus only two years 2015 and 2019, as these two years had the same number of participating teams, making it easier to compare.
However, for 2019, we had to cross Nigeria and Cameroon off the list as these two countries missed a lot information regrading the caps and the goals. The result figure is presented below:
Figure 4.10: Comparison of cap bar by year
Surprisingly, the US team, even though still at the top, actually had their caps lowering in 2019 compared to 2015, with only Carli Lloyd having the caps high enough to become an outliner. Most of the celebrity teams (teams that had won and achieved runner up), except for China PR and Brazil, saw their median caps dropped. Expanding further from the top 10 highest cap in table 4.2 above to the top 30, we only had a total of 12 players who participated in the most recent World Cup 2019. Whilst we cannot calculate the impact of such drops on the teams’ performance (US still won and Netherlands was the runner up in 2019), we can try to explain the reasons behind such drop in caps.
It could possibly mean that the celebrity team either (1) had their prestige players reaching the retirement age (If you revisit the top 10 table, you would see that most players in top 10 born in the late 70s and early 80s) so they had to drop out of the tournament or (2) the celebrity team considered to save their ace players for other premier leagues rather than exhaust them in the World Cup. Whilst we can infer the assumption (1) from the date of birth provided, we had no solid base for the assumption (2). All we know at the moment is that FIFA has stepped in to encourage the teams to “collaborate” in exchange for a handsome commission under the form of supports given to the clubs helping them build up their talents.
A bonus fun fact for this graph: 2019 was the first year Jamaica qualified for the World Cup and the teams comprised of members who never have once played for any Jamaica clubs!
We will conclude this section by examining the relationship between the age, the caps and the goals of a World Cup player whether one or more variables would explain the other, or in this case, whether the caps and age of a player would determine her ability to score more goals.
In this section we used ggscatmat function to visualise the relationship between the variables in four different years, from 2007 to 2019. Again years before 2003 were filtered out due to lack of information.
Figure 4.11: Relationship between ages, caps and goals
In figure 4.11, we have the density plot for each variables, the correlation for each pair of variables and the scatter plot. Some insights we can draw from the club would include:
In this exercise, we have learnt a bridging connection from the domestic clubs to the performance of their nations in an international field, which may appear to be counter - intuitive at first reception. If a country has managed to attract a lot of players played for its national clubs, the better its performance can be at any tournament. Take the US for an example, the number of foreign players played for the US clubs as documented by FIFA was the highest out of all countries. Now that those foreign players came back to fight for their national teams, would you question how that would benefit the performance of the US national team? The US had won the World Cup half the time it was held until now, with another one being a runner up. The underlying fact was that if a country was attractive enough to have the players from everywhere come over, such country itself possesses enough resources to invest in talents, which solidify its stance further in any international game.
Ages also influence over whether a player could be qualifies for World Cup or not. On average, a World Cup player would not be younger than her 20s and an ideal age range would be around 25 to 27 years. We also learnt that age and caps have a roughly positive relation where the higher the age, the more foreign experience a player most likely has. Unlike the age and caps, there was not a recognizable relation between the age and the goals. Higher age does not necessarily associate with higher goals. In fact, most goals were scored by players in their 20s - 30s. The position of players need to be taken into consideration rather than just goals to gauge a player’s performance.
Finally, from the data we have analysed, we formed up an assumption that some of the national clubs either had their prestige players reaching the retirement age, by the most recent Women World Cup 2019 or they might have been holding back from sending their ace players to the field. Whether the assumption holds or not, we only knew that FIFA has responded by introducing the support package for the first time ever, by which the national clubs will be benefited by releasing the best out of the safe vault and allow them to compete in the World Cup.
The packages used are as follows:
ggpplot2 (Wickham 2016)
tidyverse (Wickham et al. 2019)
kableExtra (Zhu 2019)
bookdown (Xie 2020)
dplyr (Wickham et al. 2020)
plotly (Sievert 2020)
GGally (Schloerke et al. 2020)
Homewood, Brian. 2020. U.S. https://www.reuters.com/article/us-soccer-world-fifa/fifa-to-pay-clubs-209-million-for-world-cup-collaboration-idUSKBN0MG1I320150320.
Schacht, Kira. 2019. “World Cup Shows How Nations Back Women’s Soccer — or Don’t | Dw | 27.06.2019.” DW.COM. https://www.dw.com/en/world-cup-shows-how-nations-back-womens-soccer-or-dont/a-49359480.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2020. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
SportMob. 2020. “Best Female Soccer Players of 2019.” SportMob. SportMob. https://sportmob.com/en/article/152786-best-female-soccer-players-of-2019.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Xie, Yihui. 2020. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.
Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.